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Abstract 

In high dimensions we propose and analyze an aggregation estimator of the precision matrix 
for Gaussian graphical models. This estimator, called graphical Exponential Screening (gES), 
linearly combines a suitable set of individual estimators with different underlying graphs, and 
balances the estimation error and sparsity. We study the risk of this aggregation estimator and 
show that it is comparable to that of the best estimator based on a single graph, chosen by 
an oracle. Numerical performance of our method is investigated using both simulated and real 
datasets, in comparison with some state-of-art estimation procedures. 


1 Introduction 

Graphical models [a [22] have become as a useful way of exploring and modeling the distribution. 
For instance, graphical models could be used to represent complex interactions among gene products 
resulted from biological processes. Such problems require us to infer an undirected graph from i.i.d. 
observations. 

Let X = {Xi,... ,Xp)^ be a random vector with some continuous distribution. An undirected 
graph G has p vertices, collected in a set V, one for each variable. We represent the edges as a 
set E of unordered pairs: {i,j) & E it and only if there is an edge between Xi and Xj. An edge 
between Xi and Xj is absent if Xi and Xj are independent, given the other variables. 

The default model for graphical modeling of continuous data is the multivariate Gaussian. Let 
data Vn := {xi^... ,Xn} he the realizations of n independent samples from a multivariate Gaussian 
distribution Afp{0, S), where E is the covariance matrix. Then the log-likelihood of is (up to a 
constant) given by 


T7 7^ 

^n(0) := 2 logdet(0) - -tr(E„0), (1) 

where 0 = E“^ is the precision matrix, i.e. inverse covariance matrix, and E„ = ^'^^^iXixf is 
the empirical covariance matrix. 

For Gaussian graphical models, it is well known that the edge between the ith and jth nodes 
is absent in the graph, meaning that the associated variables are conditionally independent given 
the other variables, if and only if 6ij — 0, where 0ij is the (i,j)th element of 0. Therefore, the 
estimation and model selection problems in Gaussian graphical models are equivalent to estimation 
of the precision matrix and identification of its zero-pattern. 
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While it is one of the classical problems in multivariate statistics, with a renewed focus on high¬ 
dimensional data in recent years, a number of sparse estimators have been proposed to deal with 
the problem of precision matrix estimation. Among them, Yuan and Lin |25] and Friedman et al. 
[5] impose an penalty on the entries of the precision matrix when maximizing the Gaussian log- 
likelihood, known as the graphical lasso, encouraging some of the entries of the estimated precision 
matrix to be exactly zero. Meinshausen and Biihlmann HU consider the neighborhood selection 
method via the lasso. Cai et al. [U propose a constrained minimization approach for sparse inverse 
covariance matrix estimation. Yuan |24j takes advantage of the connection between multivariate 
linear regression and entries of the inverse covariance matrix, developing an estimating procedure 
that can effectively exploit sparsity. Theoretical properties, including consistency in parameter 
estimation and sparsity structure recovery, are discussed in these and other papers [niiiH]. 

Given a collected family of estimators, linear or convex aggregation methods are another class of 
technique to address model selection problems and provide flexible ways to combine various models 
into a single estimator [16] . The idea of aggregating estimators was originally described in [14] . The 
suggestion put forward by m is to achieve two independent subsamples from the original sample 
by randomization: individual estimators are constructed from the first subsample while the second 
is used to perform aggregation on those individual estimators. This idea of two-step procedures 
carries over to models with i.i.d. observations where one can do sample splitting. Along with 
this method, one might aggregate estimators using the same observations for both estimation and 
aggregation. However, this would generally result in overfitting. 

A primary motivation for aggregating estimators is that it can improve the estimation risk, as 
“betting” on multiple models can provide a type of insurance against a single model being poor 
m- Most of the recent work on estimator aggregation deals with regression learning problems. For 
example. Exponential Screening (ES) for linear models provides a form of frequentist averaging over 
a large model class, which enjoys strong theoretical properties [16]. Also, an aggregation classifier 
is proposed in [U and an optimal rate of convex aggregation for the hinge risk is also obtained. 

In this paper, we propose a new estimating procedure by considering a linear combination of a 
suitable set of individual estimators with different sparsity patterns. A sparsity pattern is defined as 
a binary vector with each element indicating whether the corresponding edge of the graph is absent 
or not. These individual estimators and the corresponding aggregating weights are determined to 
ensure a competitive rate of convergence for the risk of the aggregation estimator. 

Our aggregation method is based on a sample-splitting procedure: the first subsample is set to 
construct individual estimators and the second subsample is then used to determine the weights and 
aggregate these estimators. To carry out the analysis of the aggregation step, it is enough to work 
conditionally on the first subsample so that the problem reduces to aggregation of deterministic 
estimators m- Namely, let R{-) denote a risk function, then given the deterministic estimator 0^ 
with sparsity pattern m, one can construct an aggregation estimator 0 such that the excess risk 

r := R{0) - min R{0m), (2) 

m^Ai 

is as small as possible, where Al is a candidate set of sparsity patterns. Ideally, we wish to find an 
aggregation estimator whose risk is as close as possible (in a probabilistic sense) to the minimum 
risk of individual estimators. The risk function considered in the paper is the Kullback-Leibler 
divergence. 

The rest of the paper is organized as follows. In Section[U we describe the graphical aggregation 
in detail. Theoretical properties are provided in Section]^ Numerical experiments are presented in 
Sections]^ Finally, we conclude in Section]^ Detailed proofs are collected in Appendix section. 
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2 Graphical Exponential Screening Estimator 

2.1 Notations and preliminaries 

Let I'll denote the vector norm, || • || denote the spectral matrix norm, and || • Hf denote the 
Frobenius matrix norm. For any symmetric positive-definite matrix A we use the notation A >- 0. 
For any two real numbers a and 6, we use the notation a\/ b := max(a, &). Let !(•) denote the 
indicator function. 

We call a sparsity pattern any binary vector m € A4, where is a candidate set of sparsity 
patterns and A4 C {0, The A:th coordinate of m can be interpreted as an indicator of 

presence {mk = 1) or absence (ruk = 0) of the edge {ik,jk) € E such that {ik,jk) is the fcth element 
of the ordered list Sp, where Sp = {{i,j) ■ 1 < i < j < p}- 

We partition the sample into two independent subsamples, Vn^ and , of size ni and n 2 , 
respectively, where ni -I- n 2 = n. 

2.2 Graphical aggregation method 

In the aggregation procedure, Vni is utilized to construct individual estimators and Vn] is then used 
to aggregate these estimators. Given each sparsity pattern to S AI, we first define the individual 
estimator of the precision matrix. 

Definition 2.1. For each m S AI, let be the edge set of the graph with sparsity pattern to. Let 
be the constrained maximum likelihood estimator, defined by 

= argmax {logdet(0) - tr(EW0)}, (3) 

Oec,,, 

where 


Cm = {Q F 0 ■■ Oij = 0 for any {i,j) ^ Em and i ^ j), (4) 

and denotes the empirical covariance matrix using the first subsample • 

Notice that each individual estimator maximizes the log-likelihood under the constraint that 
some pre-defined subset of the parameters are zero. If ni > qm where Pm is the maximal clique size 
of a minimal chordal cover of the graph with edge set Em, estimator Qm^ exists and is unique [19]. 
The following relationships hold regarding 0m^ and its inverse (0m^)“^: 

= 0, V(z, j) ^ Em, and (5) 

V(z,j) e EmU{i^,^),^ = l,...,p}. (6) 

Indeed we can drive the above properties via the Lagrange form, where we add Lagrange constants 
for all missing edges of Em 

^ni.m(0) :=logdet(0)-tr(SW0)- ^ (7) 

LAiErr. 

where is a Lagrange constant. The gradient equation for maximizing Q can be written as 

0-1 _ gu) _ r = 0, (8) 
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where F is a matrix of Lagrange parameters with nonzero values for all pairs with edges absent. 
It is an equality-constrained convex optimization problem, and a number of methods have been 
proposed for solving it, for example, in Hastie et al. [7j. 

Now we define the aggregation estimator, which linearly combines a set of individual estimators 
for m € AI. 

Definition 2.2. Let ©gEs^ graphical Exponential Screening (gES) estimator that linearly 

combines a set of individual estimators 0 m\ defined by 

®gES^ = H (9) 

mGA4 


where the superscripts denote which subsamples are used for constructions, and the weights 


exp{(n 2 / 2 )(logdet( 0 ^^) - tr( 0 ^^ 5 is^))} 7 r^ 
Em'eTW exp{(n 2 / 2 )(logdet( 0 ^)) - tr( 0 W 5 i^,^))} 7 r„/ 


( 10 ) 


(o') 

Here, Snf is an estimator of E, and tt^ is a (prior) probability distribution on the set of sparsity 
pattern A4, defined by 


. 1 / Hi y-i- 

H \ep{p-l)) 


( 11 ) 


where H is a normalization factor to ensure that tt^ add up to one. 

Observe that the gES estimator is a linear combination of the set of individual estimators 0m^ 
(1 2 ) • • 

with weights Vm ■ It is indeed a convex combination since the weights add up to one. 

A natural choice of Snf would be the empirical covariance matrix ■ In this scenario, ignoring 
the prior distribution tt^, the individual weight is proportional to the likelihood of estimator 0 m^ 
evaluated on the second subsample. Thus, the higher the predictive ability, the more weighting 
will be put on the corresponding individual estimator. However, as is shown in the next section, 
this would lead to a deterioration of convergence rate in high-dimensional settings. Instead, we can 
use the hard thresholding estimator proposed in Bickel and Levina [1]. To be more specific, the 
thresholding estimation of Tin) thresholded at 7 is defined by 

5(2) = T^(Si2)) := {aif • l(|ag)| > 7 )}i<.o<p, ( 12 ) 

where fiL is the (i, j)th element of T,nf. In practice, we can apply the following procedure for the 
problem of threshold selection [1]: we split the second subsample randomly into two pieces of size 
712(1 — l/log7T,2) and n 2 /logn 2 , respectively, and repeat this B times. Let „ and T,) ( be the 
empirical covariance matrices based on the two pieces, from the uth split. Then the thresholding 
parameter 7 is determined by 


7 = argmin 
7' 


B ^ 


||T.,.(Eg)-S(2)| 



(13) 
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In Definition |2.2[ we also incorporate a deterministic factor into the weighted averaging 
to account for (prior) model complexity, in a manner that facilitates desirable risk properties [101 
US]. Here, low-complexity models are favored. Alternatively, if the set of sparsity patterns A4 = 
{ 0 , the following deterministic factor specifies a uniform distribution on the cardinality 

of the subset and a conditionally uniform distribution on the subsets of that size: 


TT 


U 

m 


{p{p - l)/2 + 1) 


pip-i)/2y 

. I™li A 



(14) 


In addition, a simple way is to choose a flat prior, where we set for any m,m' G A4. 

We show in the next section that, under some technical conditions, the risk of the gES estimator 
is bounded by the risk of the best individual estimator plus a low price for aggregating. 


2.3 Approximation algorithm 

To implement the estimating procedure, note that exact computation of the aggregation estimator 
might require the calculation of as many as individual estimators. In many cases this 

number could be extremely large, and we must make a numerical approximation. Observing that 
the aggregation estimator is actually the expectation of a random variable that has a probability 
mass proportional to Vm’^^ on individual estimator for m G Ai, then the Metropolis-Hastings 
algorithm can be exploited to provide such an approximation. The detail algorithm is shown in 
Algorithm |2.I| 


Algorithm 2.1. If the set of sparsity patterns A4 = {0, then the gES estimator can he 

approximated by running a Metropolis-Hastings algorithm on a p{p — 1)/2-dimensional hypercube: 

(51) Initialize mt = {0}^^^, t — 0; 

(52) For each t>0, generate with the uniform distribution on the neighbours of mt; 

(53) Generate an [0, l]-uniformly distributed number r; 

(54) Put mt+i G- m't, if r < min{l,u)^; '/vml ’}; otherwise, mt+i G- mt; 

(55) Compute ©mt+i. Stop */< > Tq -I- T; otherwise, update t ^ t 1 and go to step (S2). 

Then we can approximate ©gEs^ by 

_ . To+T 

= A E 0^^ (15) 

To + 1 


where Tq and T are positive integers. 


Here, the neighbours of mt consists of all the sparsity patterns with a Manhattan distance of 
one to mt- The following proposition shows that the resulting Markov chain ensures the ergodicity. 


Proposition 2.1. The Markov chain {mt}o<t<To+T generated by Algorithm 2.1 satisfies 


To+T 


lim ^ H almost surely. 

1 —>^00 ± ' ^ ^ 


(16) 


To + 1 
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The proof is straightforward as the Markov chain is clearly ri^^’^^-irreducible. 

The Metropolis-Hastings algorithm incorporates a trade-off between prediction and sparsity to 
decide whether to add or discard an edge. Observe that the gES estimator would always estimate 
a precision matrix in which all the elements are nonzero, since all possible individual estimators 
are linearly mixed. However, the Metropolis-Hastings algorithm would lead to a sparse precision 
matrix estimation as in the regression case |16j . 


3 Theoretical Properties 

In this section, we show that under some technical conditions, the risk of the gES estimator is 
bounded by the risk of the best individual estimator in the dictionary plus a low aggregating price. 

Eollowing the notations in Bickel and Levina we define the uniformity class of covariance 
matrices invariant under permutations by 

p 

U{q,d,M) = {e e an <M,J2 for (17) 

f=i 


for 0 < g < 1, where M and 5 are constants. 

Eor any estimator 0 >- 0, we define the risk function 

R{Q) = tr(0S) — logdet(0). (18) 

Note that E is the true covariance matrix. Consider the Kullback-Leibler (KL) divergence 

KL(0) = i?(0) - i?(0) > 0. (19) 


Eor each individual estimator corresponding to m S Al, we assume 0m^ exists and is unique. 
The following proposition relates the KL risk of the aggregation estimator 0gEs lo the KL risks of 
individual estimators 0m^. 


Proposition 3.1. The gES estimator 0gEs'^ Definition 2.2 satisfies the following inequality 


KL(0g)) < mm {kL( 0W) + A log A + tr((0^) - " S))} (20) 

< min |kL(0W) + a log ^ + tr(0«(5^^) - E)) + - E|| • ^(©('e^A (21) 

m^M I 712 ^ & 

It is shown in Vershynin [20] that under some conditions 


ll£A-s|| 



( 22 ) 


/ 2 ) 

thus if we use the empirical covariance matrix as Snf, Proposition 3.1 implies that this would lead 


to a deterioration of convergence rate in high dimensions. Then we choose to use the thresholded 
covariance matrix estimation in Bickel and Levina [T]. 

We consider the following assumption for further analysis of the remainder term. 
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Assumption 3.1. The set of sparsity patterns M satisfies the following condition 

max tr(0W) = Op{p). (23) 

meM 

This assumption ensures a fast convergence rate of the aggregation estimator. It is shown in 
Dahl et al. [3] that the inverse of is a solution to the following problem that maximizes the 
determinant of a symmetric positive definite matrix Z: 


max logdet Z, 
zyo 

s.t. Z,J = V(iJ) e E^U{(i,i),i = l,...,p}. 


(24) 


In general, note that for any two graphs G and G' with sparsity patterns m and m', respectively, 
if G is a subgraph of G', then the trace of 0^) would always be larger than that of 0m^- Thus 
the most dense graph in the dictionary would always be able to achieve the maximum trace among 
all individual estimators. This assumption claims that the diagonal entries of the true precision 
matrix 0 are well estimated by all individual estimators in the dictionary Ai. 

Let m* € M he the sparsity pattern that attains the minimum KL risk in the dictionary A4: 


m* = argmin KL(0(^^). 
m^M. 

The following theorem shows the oracle inequality that the aggregation estimator satisfies. 


(25) 


Theorem 3.1. Suppose Assumption 3.1 hold. Uniformly onU{q,d, M), for sufficiently large K, if 
the thresholding parameter A = and = o(l), then the gES estimator 0gEs^ satisfies 


min KL(0W) = Op 


logp /logpA^^ 


n2 


+ P 


V «2 / 


(26) 


where s* = \m*\i is the number of nonzero off-diagonal elements ofQm*. 

This theorem yields a rate of convergence of the excess risk. In particular, if the set of sparsity 
patterns Ai also includes the true sparsity pattern . Let be the number of nonzero off-diagonal 
elements of 0. It is shown in Zhou et al. [26] that under some technical conditions 

KlOS) = Op . (27) 

and assume that = 0(p), we obtain 

) = Op (28) 

for 0 < <7 < 1. 


Combine it with Theorem 3.1 


4 Experimental Results 

In this section, we provide empirical evidence to illustrate the usefulness of the proposed gES 
estimator and compare it with other state-of-the-art methods in parameter estimation and graph 
recovery using simulated and real datasets. 
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4.1 Numerical simulations 


We generate synthetic datasets with sample size n = 200 or 400, and number of nodes p = 50,100 
or 200. We use the following three models for simulating graphs and precision matrices. Figure 
displays a typical run of the generated graphs of the precision matrices when p = 100. 

(a) “AR”: The off-diagonal (z, j)th element of the adjacency matrix is set to be 1 if |z — j\ = 1 
and 0 otherwise; 

(b) “Hub”: The vertices are evenly partitioned into p/10 disjoint groups. Each group is associated 
with a center vertex i in that group and the off-diagonal (z,j)th element of the adjacency 
matrix is set to 1 if j also belongs to the same group as i and 0 otherwise; 

(c) “Random”: The off-diagonal (z, j)th element of the adjacency matrix is randomly set to be 1 
with certain probability and 0 otherwise. 



(a) AR (b) Hub (c) Random 


Figure 1: An illustration of the three graph patterns with p = 100 nodes. 

We compare the proposed gES estimator with the graphical lasso (glasso) [SI [53] and constrained 
£i minimization estimator (CLIME) |5] with the tuning parameters determined by 10-fold cross- 
validation. We also consider the method of multiple testing of hypotheses about vanishing partial 
correlation coefficients II 0, where the family-wise error rates for incorrect edge inclusion are 
controlled by Bonferroni correction at level a = 0.05. We refer this method as “pcorTest” in this 
paper. For the gES estimator, the full dataset is random partitioned into two subsamples with equal 
size, while for other methods, the graphs are estimated using the full dataset in order to make the 
comparison fair. 

Notice that when p is large, the Metropolis-Hastings algorithm for the gES estimator might 
take a large number of iterations to generate a good random tour over the hypercube with as many 
as p{p — l)/2 dimensions, and thus could become computationally unsuitable. In this case, we 
need to approximate the model space and select a candidate set of edges before the execution of the 
Metropolis-Hastings algorithm. We identify a candidate set of edges by applying some pre-screening 
method to the first subsample Vni ■ Let the ordered list Qp C Sp be the set of selected edges. Then 
our aggregation process is constructed on this subset of edges. Then the set of sparsity patterns is 


given by 


where 


p(p-l)/2 

M= n 

fe=i 



if (ikijk) S Qp, where {ik,jk) is the fcth element of Sp 
otherwise 


(29) 


(30) 


In practice where p is large, the Metropolis-Hastings algorithm introduced in Algorithm |2.1| will be 
applied to this reduced set of sparsity patterns instead of the set of all possible edges {0, 

Many pre-screening methods could be considered here, for instance, the glasso. Note that we prefer 
to choose a small regularization parameter to prevent any true edge being ruled out in this stage. 
An alternative algorithm could be based on the £i regularization paths from the glasso method. Let 
be a set of regularization parameters. For each (3 GB, let TO/j be the sparsity pattern resulted from 
glasso with tuning parameter (3. Then the aggregation method is constructed on the set of sparsity 
patterns M = {mp : (3 G B}. However, in practical applications, the Metropolis-Hastings algorithm 
based on the ii regularization paths always concentrations on a single model after convergence, and 
generally does not perform well. 

For any estimator 0 0, we use the following criteria for the comparisons. 

(1) Squared Frobenius norm error: 

Frobenius = ||0 — 0||^; (31) 


(2) Kullback-Leibler loss: 

KL = — logdet(0) -I- trace(0S) — (— logdet(0) +p); (32) 


(3) Precision: the number of correctly estimated edges divided by the total number of edges in 
the estimated graph. Let G = [V, E) be a p-dimensional graph and let G = {V, E) be an 
estimated graph. We define 


Precision = 


\Er]E\^ 
\E\ ' 


(33) 


(4) Recall: the number of correctly estimated edges divided by the total number of edges in the 
true graph, defined by 


Recall = 


\EnE\ 


(5) Fi-score: a weighted average of the precision and recall, defined by 

^ „ Precision • Recall 

-score = 2 


Precision -|- Recall ’ 
where an Fi-score reaches its best value at 1 and worst score at 0. 


(34) 


(35) 
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Table 1: Quantitative comparison of different methods on simulated data; Averaged quantities with 
their standard errors (in parentheses) are reported. 


Model 

Estimator 


{n 

.p) = (200.50) 


Frobenius 

KL 

Precision 

Recall 

El-score 

AR 

gES 

6.06 (0.97) 

1.61 (0.29) 

0.73 (0.05) 

0.97 (0.02) 

0.83 (0.03) 


pcorTest 

5.91 (1.48) 

1.73 (0.49) 

0.99 (0.01) 

0.89 (0.04) 

0.94 (0.02) 


glasso 

6.39 (0.82) 

1.51 (0.14) 

0.21 (0.03) 

1.00 (0.00) 

0.35 (0.04) 


CLIME 

5.46 (0.76) 

1.77 (0.25) 

0.15 (0.02) 

1.00 (0.00) 

0.26 (0.03) 

Hub 

gES 

6.29 (1.57) 

1.44 (0.31) 

0.65 (0.06) 

0.97 (0.02) 

0.78 (0.04) 


pcorTest 

26.56 (3.77) 

6.02 (0.67) 

0.98 (0.02) 

0.53 (0.06) 

0.69 (0.05) 


glasso 

16.43 (1.84) 

1.53 (0.13) 

0.18 (0.02) 

1.00 (0.00) 

0.31 (0.03) 


CLIME 

7.34 (0.99) 

3.03 (0.43) 

0.14 (0.01) 

1.00 (0.00) 

0.24 (0.02) 

Random 

gES 

5.94 (1.72) 

1.72 (0.46) 

0.73 (0.05) 

0.91 (0.06) 

0.81 (0.04) 


pcorTest 

8.55 (3.63) 

2.45 (0.93) 

0.99 (0.02) 

0.72 (0.13) 

0.83 (0.09) 


glasso 

7.09 (1.18) 

1.31 (0.16) 

0.20 (0.03) 

1.00 (0.01) 

0.33 (0.04) 


CLIME 

4.59 (0.91) 

1.68 (0.32) 

0.13 (0.02) 

1.00 (0.00) 

0.23 (0.03) 




(n.p) = (200.100) 


Model 

Estimator 

Frobenius 

KL 

Precision 

Recall 

El-score 

AR 

gES 

13.94 (2.58) 

3.10 (0.55) 

0.84 (0.03) 

0.98 (0.02) 

0.90 (0.02) 


pcorTest 

17.48 (2.70) 

4.55 (0.72) 

0.91 (0.03) 

0.88 (0.03) 

0.89 (0.02) 


glasso 

20.89 (1.77) 

3.71 (0.21) 

0.17 (0.02) 

1.00 (0.00) 

0.29 (0.02) 


CLIME 

13.90 (1.47) 

4.79 (0.50) 

0.13 (0.01) 

1.00 (0.00) 

0.23 (0.02) 

Hub 

gES 

14.27 (2.77) 

3.29 (0.58) 

0.77 (0.03) 

0.97 (0.02) 

0.86 (0.02) 


pcorTest 

72.07 (5.98) 

14.92 (0.93) 

0.81 (0.08) 

0.42 (0.04) 

0.55 (0.04) 


glasso 

38.56 (1.67) 

3.61 (0.17) 

0.15 (0.01) 

1.00 (0.00) 

0.27 (0.02) 


CLIME 

17.27 (2.14) 

8.48 (0.89) 

0.12 (0.01) 

1.00 (0.00) 

0.22 (0.02) 

Random 

gES 

12.89 (3.05) 

3.94 (0.82) 

0.85 (0.04) 

0.88 (0.05) 

0.86 (0.04) 


pcorTest 

26.05 (7.44) 

7.15 (1.70) 

0.82 (0.06) 

0.58 (0.10) 

0.68 (0.08) 


glasso 

16.58 (2.24) 

3.04 (0.22) 

0.18 (0.02) 

1.00 (0.00) 

0.30 (0.04) 


CLIME 

10.39 (1.12) 

4.27 (0.50) 

0.11 (0.01) 

1.00 (0.00) 

0.20 (0.02) 




(n.p) = (400.100) 


Model 

Estimator 

Frobenius 

KL 

Precision 

Recall 

El-score 

AR 

gES 

5.54 (1.12) 

1.14 (0.18) 

0.82 (0.03) 

1.00 (0.00) 

0.90 (0.02) 


pcorTest 

2.46 (0.30) 

0.51 (0.06) 

0.99 (0.01) 

1.00 (0.00) 

1.00 (0.00) 


glasso 

12.17 (0.95) 

1.93 (0.10) 

0.17 (0.01) 

1.00 (0.00) 

0.29 (0.02) 


CLIME 

7.55 (0.58) 

2.08 (0.21) 

0.14 (0.01) 

1.00 (0.00) 

0.25 (0.02) 

Hub 

gES 

5.27 (1.12) 

1.16 (0.18) 

0.82 (0.03) 

0.99 (0.01) 

0.90 (0.02) 


pcorTest 

11.20 (2.66) 

2.96 (0.69) 

0.99 (0.01) 

0.90 (0.03) 

0.94 (0.02) 


glasso 

23.76 (1.76) 

1.91 (0.09) 

0.15 (0.02) 

1.00 (0.00) 

0.26 (0.03) 


CLIME 

8.31 (0.78) 

3.20 (0.27) 

0.13 (0.00) 

1.00 (0.00) 

0.23 (0.01) 

Random 

gES 

5.46 (0.85) 

1.98 (0.33) 

0.84 (0.03) 

0.94 (0.04) 

0.88 (0.03) 


pcorTest 

6.53 (1.76) 

2.42 (0.65) 

0.99 (0.01) 

0.82 (0.06) 

0.90 (0.04) 


glasso 

6.48 (0.75) 

1.69 (0.15) 

0.19 (0.02) 

1.00 (0.00) 

0.32 (0.03) 


CLIME 

5.03 (0.54) 

1.95 (0.17) 

0.14 (0.02) 

1.00 (0.00) 

0.24 (0.02) 




(n.p) = (400.200) 


Model 

Estimator 

Frobenius 

KL 

Precision 

Recall 

El-score 

AR 

gES 

14.27 (2.42) 

2.99 (0.53) 

0.89 (0.02) 

0.99 (0.01) 

0.94 (0.01) 


pcorTest 

6.02 (1.13) 

1.36 (0.28) 

0.92 (0.02) 

1.00 (0.00) 

0.96 (0.01) 


glasso 

29.93 (2.37) 

4.62 (0.19) 

0.14 (0.02) 

1.00 (0.00) 

0.25 (0.03) 


CLIME 

16.33 (1.53) 

5.66 (0.61) 

0.10 (0.03) 

1.00 (0.00) 

0.18 (0.05) 

Hub 

gES 

13.49 (2.63) 

2.89 (0.54) 

0.83 (0.04) 

0.99 (0.01) 

0.90 (0.03) 


pcorTest 

44.47 (4.72) 

11.25 (1.11) 

0.89 (0.03) 

0.79 (0.02) 

0.84 (0.02) 


glasso 

54.98 (1.84) 

4.45 (0.16) 

0.13 (0.00) 

1.00 (0.00) 

0.23 (0.01) 


CLIME 

19.65 (1.68) 

8.17 (0.50) 

0.14 (0.00) 

1.00 (0.00) 

0.25 (0.01) 

Random 

gES 

10.43 (1.24) 

3.88 (0.51) 

0.95 (0.02) 

0.90 (0.04) 

0.92 (0.02) 


pcorTest 

13.21 (1.81) 

4.95 (0.74) 

0.87 (0.04) 

0.77 (0.06) 

0.81 (0.04) 


glasso 

11.67 (1.18) 

3.24 (0.20) 

0.16 (0.01) 

1.00 (0.00) 

0.28 (0.02) 


CLIME 

9.77 (0.97) 

3.56 (0.27) 

0.12 (0.01) 

1.00 (0.00) 

0.21 (0.02) 
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Tableshows the simulation results of the quantitative comparison of different methods, where 
we repeat the experiments 50 times and report the averaged values with their standard errors. We 
can see that the gES and CLIME estimators perform better than glasso and pcorTest in term of 
the squared Erobenius norm errors, while gES and glasso are better than CLIME and pcorTest 
regarding the Kullback-Leibler loss. Eor the comparison of graph structure recovery, the gES and 
pcorTest estimators outperform other methods. 

Eigurej^ displays the evolution of the Metropolis-Hastings algorithm, showing evidence that the 
algorithm converges after 4000 iterations. Figure]^ shows a typical realization of the gES method, 
varying the regularization parameter A in the pre-screening glasso, where we can see that the results 
for graphical aggregation are not quite sensitive to the pre-screening method. 



Figure 2: Number of selected edges by the Metropolis-Hastings algorithm as a function of iterations, 
given a typical run of the gES estimator for Hub graph on simulated data with {n,p) = (400,100). 

The computational complexity of glasso [5] which uses a row-by-row block coordinate algorithm 
is roughly O(p^). For the gES estimator, the complexity is roughly 0{Lp^), where L = T + Tq is 
the number of iterations in the Metropolis-Hastings algorithm. Note that L can be dramatically 
reduced if we keep track of and store the individual estimators to avoid duplicated computation of 
precision matrix with the same sparsity pattern. 

4.2 Analysis of microarray data 

In this study, we consider a real-world dataset based on Affymetrix GeneChip microarrays for the 
plant Arabidopsis thaliana [2dj . The sample size is n = 118. A nonparanormal transformation is 
estimated and the expression levels for each chip are replaced by their respective normal scores, 
subject to a Winsorized truncation m- A subset of p = 40 genes from the isoprenoid pathway 
are chosen, and we study the associations among them using the proposed gES estimator and the 
glasso method with tuning parameter determined by cross-validation. 
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■ Frobenius BKL 


■ Precision ■ Recail 



\=0.11, X=0.10, X=0.09, X=0.08, X=0.07, X=0.06, X=0.11, X=0.10, X=0.09, X=0.08, X=0,07, X=0.06, 

E=553 E=675 E=829 E=965 E=1245 E=1585 £=553 E=675 E=829 E=965 E=1245 E=1585 


Figure 3: Typical realization of the gES estimator for Hub graph on simulated data with (n,p) = 
(400,100), varying the regularization parameter A in the pre-screening glasso; Number of selected 
candidate edges (marked as E) are also reported. 


The results show that glasso selects 378 edges and gES selects 111 edges. Among those selected 
edges, 102 edges are identified by both methods. Figurej^shows grids of rectangles with gray scale 
corresponding to the absolute values in the estimated precision matrix for each method. 




(a) glasso 


(b) gES 


Figure 4: Grids of rectangles with gray scale corresponding to the absolute values in the estimated 
precision matrix for each method. 

We also analyze a dataset on microarrays for the gene expression levels [13]. Of the 4238 
genes in immortalized B cells for n = 295 normal individuals, we select p = 318 genes that are 
associated with the phenotypes in genome-wide association studies. We study the estimated graphs 
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obtained by the glasso and gES estimators. The expression levels for each gene are pre-processed 
by log-transformation and standardization. 

The results indicate that glasso selects 12514 edges and the gES estimator selects 1631 edges. 
Among those, 1542 edges are identihed by both methods. Figure [^provides the estimated graphs. 



(a) glasso 



(b) gES 


Figure 5: Estimated graphs for microarray data example for immortalized human B cells. 


5 Conclusion 

In this paper, we propose a new aggregation method for estimating the precision matrix in Gaussian 
graphical models, by considering a convex combination of a suitable set of individual estimators 
with different underlying sparsity patterns. We investigate the risk of this aggregation estimator 
and show by an oracle that it is comparable to the risk of the best estimator based on a single 
graph. Experimental results validate the usefulness of our method in practice. 
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A Appendix 

A.l Proof of Proposition |3.1| , 

Proof. For any m G M and individual estimator 0^ 0, we obtain 

KL(0W) = tr(0^)S) - logdet(0^)) -p + logdet(0). 

Similarly, for the aggregation estimator 0gEs ^ 0 

KL(0 ^es^) = tr(0^^Eg^E) - logdet(0^^Eg^) -p + logdet(0). 
Based on the convexity of KL(-), we obtain 

KL( 0 ('es’)< E ^L'’^^KL( 0 W). 

m£A4 


Let m S At being any sparsity pattern attaining 

min (kL( 0 W) + A log J- + tr( 0 W(^^ 2 i _ ^ 

meM [ n 2 J 

( 12 ) 

Then according to the definition of Vm’ , we obtain 

exp{(n2/2)(logdet(0A) - sif))} 7 r^ 


exp{(n 2 / 2 )(logdet( 0 ^^) - tr(0)^AA))}7rm 

= ^ exp{-(n2/2)[KL(0A) - KL(0W) + tr((0W - 0^)- E))]}, 


where the last equality follows from the fact that 

KL(0W) = tr(0W(E - Sif)) - (logdet(0«) - ^(0^5^)) - p + logdet(0). 

(l 2) 

Note that J2meM ' ~ Th^n the inequality (38) can be written as 


KL( 0 ()^A<KL( 0 A)+ E A'’"^(KL( 0 «)-KL( 0 A)) 

„(A2) 


mGAi 




<KL(e«) + - 


mGAI 
( 1 ) 


Vm 


( 1 , 2 ) ' ^ ' 

mGAi 


,( 1 . 2 ) 


1 '*771 


+ ^ ^,A2)t^(ea)(^2)_5.))_ ^ ^;a,2)tr(0W(^(2)_s)) 


mGA4 


mGAi 


«<■)', j- A w ». 


<KL(e«) + - 


112 


log- 


mGAi 


Vm 


( 1 . 2 ) „2 




,( 1 , 2 ) 


ttiGAI 


+ tr(0AA^Es))- E iiA"Hr(0«A^^)-E)). 


ttiGAI 


(36) 

(37) 

(38) 

(39) 

(40) 

(41) 

(42) 

(43) 

(44) 

(45) 
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According to the fact that 


< 0: and log < 0, (46) 

meM 

we obtain 

KL(0('es^) <KL(0W) + a log ^ + tr(0A(5A - S)) - tr( 0 g)(§( 2 ) _ 5 .))^ ( 47 ) 

It is shown in Wang et al. m that for any px p real symmetric matrix A and any pxp positive 
semidefinite matrix B 


Ap(A)tr(_B) < tr(Ai?) < Ai(A)tr(i?), 


where Ai(A) is the tth largest eigenvalue of A. 
Following this property, we obtain 


- E)) < ||§A - S|| . tr(0('j,i|A 


gES 


Then we write the inequality (|47|) as 


KL(0(1j^A < KL(0W) + A log ^ + tr(0A(A") - S)) + lAA " 


^gES 


n2 


According to the definition of m as in (39), the proposition then follows. 


(48) 


(49) 


(50) 

□ 


A.2 Proof of Theorem 13.11 

Proof. Since m* € A4 is any sparsity pattern attaining min^eAi KL(0mA then 

KL(0 ('e^)) <KL(0 (;:!) + a log ^ + tr(0A(^A - S)) + & - S|l ■ tr(0As^) 


n2 tt, 

2 . 1 


(51) 


= niin KL(0(^)) + _ log — + tr(0^) - E)) + |- E|| • tr(0gA (52) 


meA4 


«2 


The normalization factor H of the prior probability in Definition 2.2 satisfies the following 
inequality 


^=E 

m^M. 
P(p-l)/2 

^ E 

A.-0 

P(p-l)/2 

^ E 

k=0 

< 2 . 


|m|i 


ep{p - 1) 


|m|i 


fp{p-l)/2 

k 


ep(jp - 1) 


ep{p - l)/2 


ep{p - 1) 


(53) 

(54) 

(55) 

(56) 
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Then for any m £ M. we have 


Thus we obtain 



< Imlilog 


/ ep{p- 1) \ 

V l™li V 1 ) 


+ log 2 . 



712 


Op 


^ s* logp ^ 


where s* = |m*|i is the number of nonzero off-diagonal elements of 0 m* • 
Note that 


(57) 


(58) 


tr(0i;ll(§i",)-S))< 115(2)-E|| 

•tr(0^)) 

(59) 

< l|5(2.^ - S|1 

• max tr(0W), 

(60) 

||5(2)-S||.tr(0(i^2)) = ||§^^)-S|! 

■ T, 

(61) 




lA 

1 

■ maxtr(0W). 
ra^M. 

(62) 


Given the assumptions of the theorem, it is shown in Bickel and Levina [T] that 


=Op 



(63) 


Assumption 3.1 provides a 


of (58 1 and (63), the theorem 


stochastic bound for uiajCmeM tr( 0 m^). 
follows. 


Combining it with the results 

□ 
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